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(54) Title: DIGITAL PROCESSOR FOR SIMULATING OPERATION OF A PARALLEL PROCESSING ARR.AY 



(57) Abstract 

A digital processor for simulating operation of a parallel pro- 
cessing array incorporates digital processing units (Pi to Pg) commu- 
nicating data to one another via addresses in memories (Mq to Mg) 
and registers (Rn to R4i). Each processing unit (e.g. Pi) is pro- 
grammed to input data and execute a computation involving up- 
dating of a stored coefficient followed by data output. Each com- 
putation involves use of a respective set of data addresses for data 
input and output, and each processing unit (e.g. Pi) is pro- 
grammed with a list of such sets employed in succession by that 
unit. On reaching the end of its list, the processing unit (e.g. Pi) re- 
peats it. Each address set is associated with a conceptual internal 
cell location in the simulated array (10), and each list is associated 
with a respective sub-array of the simulated array (10). Data is in- 
put cyclically to the processor (40) via input/output ports {1/0$ to 
I/O3) of some of the processing units (P5 to P3). Each processing 
unit (e.g. Pi) executes its list of address sets within a cycle at a rate 
of one address set per subcycle. At the end of its list, each of the 
processing units (Pi to P3) has executed the functions associated 
with a conceptual respective sub-array of simulated cells (12), and 
the processor (40) as a whole has simulated operation of one cycle 
of a systolic array (10). Repeating the address set lists with further 
processor input provides successive simulated array cycles. 
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DIGITAL PROCESSOR FOR SIMULATING OPERATION OF A PARALLEL 
PROCESSING ARRAY 

This invention relates to a digital processor for simulating operation of a parallel 
05 processing array, such as a systolic array. 

The field of parallel processing arrays was developed to overcome a well-known 
problem in conventional digital computers, the "Von Neumann bottleneck". This 
problem arises from the serial nature of conventional computers, in which 
10 programme steps or instructions are executed one at a time and in succession. 
This means that the computer operating speed is restricted to the rate at which 
its central processing unit executes individual instructions. 

To overcome the operating speed problem of conventional computers, parallel 

1 5 processors based on systolic array architectures have been developed. One such 
is disclosed in British Patent No. GB 2, 151, 378B, which corresponds to United 
States Patent No. 4,727,503. It consists of a triangular array of internal and 
boundary ceils. The boundary cells form the array diagonal and are 
interconnected via delay latches. The internal cells are in above-diagonal 

20 locations. The array includes nearest-neighbour ceil interconnection lines defining 
rows and columns of cells. The cells are activated cyclically by a common 
system clock. Signal flow is along the rows and down the columns at the rate 
of one cell per clock cycle. Each cell executes a computational function on 
each clock cycle employing data input to the array and/or received from 

25 neighbouring cells. Computation results are output to neighbouring cells to 
provide input for subsequent computations. The computations of individual cells 
are comparatively simple, but the systolic array as a whole performs a much 
more complex calculation, and docs so in a recursive manner at potentially high 
speed. In effect, the array subdivides the complex calculations into a series of 

30 much smaller cascaded calculations w»ich are distributed over the array processing 
cells. An external control computer is nc required. The cells are 
clock-activated, each operates on every clock cycle. The maximum clock 
frequency or rate of processing is limited only by the rate at which the slowest 
individual cell can carry out its comparatively simple processing function. This 

35 results in a high degree of parallelism, with potentially high speed if fast 
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processing ceils are employed. The "bottleneck" of conventional 
computers is avoided. 

The disadvantage of prior art systolic arrays is that, in all but the 
simplest problems, large numbers of cells are required. As will be 
described later in more detail, a prior art triangular array for dealing 
with an n-dimensional computation requires in the order of nV2 internal 
cells. In consequence, the number of internal cells required grows as 
the square of the number of dimensions of the computation. The number 
of boundary cells grows only linearly with number of dimensions. One 
important application of a triangular systolic array relates to 
processing signals from an array of sensors, such as a phased array of 
radar antennas. Typical radar phased arrays incorporate in the 
region of one thousand or more antennas, and a systolic array to^ process 
the antenna signals would require of the order of one million 
processing cells. Each cell required the processing functions and 
connectivity capabilities of a transputer to enable communications 
between neighbouring cells. Special purpose integrated circuits could 
also be used, in which "cells" constitute respective areas of a silicon 
chip or wafer. Since transputers are priced in excess of £100 each, 
the cost of a systolic array would be prohibitively expensive for radar 
phased array purposes. It is also prohibitively expensive for many 
other signal processing applications characterised by high 
dimensionality . 

There is a need for digital processing apparatus which has a degree of 
parallelism to overcome conventional computer disadvantages, but which 
requires fewer processing cells than a prior art systolic array. 



wo 92/03802 PCr/GB91/0I39e 

- 3 - 



It is known from EP - A - 0 021 404 to employ an array of specially 
designed processors in a computer system for, the simulation of logic 
operations. These processors operate in parallel. However, this prior 
art parallel array is disadvantageous in that data flow through it 
requires a multi-way switch operated by a computer. For i processors, 
the switch is i-by-i-way so that each processor can be connected to each 
of the others under computer control. This is not compatible with a 
systolic array architecture, in which (a) there is no controlling 
computer, (b) data flow paths in the array are fixed, (c) data flow is 
between nearest neighbours, (d) there are no external control 
instructions, and (e) conventional general purpose processors (eg 
transputers) may be used with programming to execute fairly 
straightforward arithmetic functions. Indeed, a major objective of 
systolic array architectures is to avoid the need for a controlling 
computer. 

US Patent No. 4,622,632 to Tanimoto et al. relates to a pattern 
matching device which employs arrays of processors for operating on 
pyramidal data structures. Here the processors operate under the 
control of what is said to be a "controller", by which is presumably ' 
meant a control computer. The controller provides instructions. to each 
of the processors in synchrony. The instructions both provide data 
store addresses and dictate which of its various processing functions an 
individual processor employs. Each processor perfonns a 
read-modify-write cycle in which data in a memory module is written back 
out to the same address from which it was obtained. As discussed above 
for EP - A - 0,021,404. this is not compatible with a systolic array 
architecture, in which (a) there is no controlling computer, (b) data 
flow paths in the array are fixed, (c) data flow is between nearest 
neighbours, and (d) there are no external control instructions. 

It is an object of the present invention to provide a digital processor 
suitable for simulating operation of a parallel processing array such as 
a systolic array. 
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The present invention provides a digital data processor for simulating 
operation of a parallel processing array, the processor including an 
assembly of digital processing devices connected to data storing means, 
characterised in that:- 

(a) each processing device is programmed to implement a respective list 
of sets of storing means data addresses; 

(b) each address set contains input data addresses and output data 
addresses which differ, and each such set corresponds to data 
input/output functions of a respective simulated array cell; 

(c) each list of address sets corresponds to a respective sub-array of 
cells of the simulated array, and each such list contains pairs of 
successive address sets in which the leading address sets have 
input data addresses like to output data addresses of respective 
successive address sets, each list being arranged to provide for 
operations associated with simulated cells to be executed in 
reverse order to that corresponding to data flow through the 
simulated array; and 

(d) each processing device is programmed to employ a respective first 
address set to read input data from and write output data to the 
data storing means, the output data being generated in accordance 
with a computational function, to employ subsequent address sets in 
a like manner until the list is complete, and then to repeat this 
procedure cyclically. 
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The invention provides the advantage that it requires a reduced number 
of processing devices compared to a prior art array (such as a systolic 
array) which it simulates. The reduction is in proportion to the number 
of address sets per list. Each processing device is in effect allocated 
the functions of a number or sub-array of similated array cells, and is 
programmed to execute the functions of a number of these cells in 
succession and then repeat. The simulated array operation is therefore 
carried out. albeit at a reduced rate. However, a degree of parallelism 
is preserved because the overall computation is distributed over an 
assembly of individual processing devices. In consequence, the 
parallelism advantage over a conventional computer is retained. The 
invention might be referred to as a semi-parallel processor. 

The invention may be arranged so that each processing device 
communicates with not more than four other processing devices; it may 
then incorporate storing means including register devices and memories 
connected between respective pairs of processing devices. The • 
invencion may incorporate storing means arranged to resolve addressing 
conflicts; preferably however the address lists are arranged such that 
each register device and memory is addressed by not more than one 
processing device at a time. Some of the processing devices may be 
arranged to communicate with two of the other processing devices via 
respective register devices. In this case the address set lists are 
arranged such that the register devices are addressed less frequently 
than the memories. 
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Each processing device may be arranged to store and update a respective 
05 coefficient in relation to each address set in its list. 

The invention may incorporate processing devices with input means arranged for 
parallel to serial conversion of input data elements. This enables the processor 
to implement simultaneous input as in the systolic array which it simulates. 

10 

In order that the invention might be more fxilly understood, embodiments thereof 
will now be described, by way of example only, with reference to the 
accompanying drawings, in which:- 

15 Figures 1. 2 and 3 illustrate the construction and mode of operation of a 

prior art systolic array; 

Figure 4 is a block diagram of a processor of the invention arranged to 
simulate part of the Figure 1 array and incorporating eight processing units; 

20 

Figure 5 illustrates the mode of operation of the Figure 4 processor 
mapped on to the Figure 1 array; 

Figure 6 illustrates read and write functions of a processing unit 
25 incorporated in the Figure 4 processor; 

Figure 7 illustrates memory and programming arrangements associated with 
individual processing units in the Figure 4 processor; 

30 Figure 8 schematically illustrates memory addressing in the Figure ^ 

processor; 

Figure 9 is a block diagram of an input/output port for a processing unit; 



35 
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Figure 10 and 11 illustrate the construction and mode of operation of an 
alternative emi^odiment of the invention incorporating an odd number of 
processing units; and 

05 Figure 12 Qlustrates the mode of operation of a further embodiment of the 

invention incorporating four processing devices. 

Referring to Figure 1» a prior art triangular systolic array 10 is shown 
schematically. The array 10 is of the kind disclosed in British Patent No. 

10 2,151,3785 (US Pat, No. 4J27.503). It includes a 15 x 15 above-diagonal 
sub-array of internal ceils indicated by squares 12. A linear chain of fifteen 
boundary cells 14 shown as circles forms the triangular array diagonal. Adjacent 
boundary cells 14 are interconnected via one-cycle delay ceils or latches indicated 
by dots 16. A multiplier ceil 18 is connected to the lowermost internal and 

15 boundary cells 12 and 14. Each of the cells 12 to 18 is activated by a system 
clock (not shown), and the cells 12 to 16 carry out pre-arranged computations 
on each clock cycle. Input to the array 10 is from above as indicated by arrows 
20, Horizontal outputs from boundary cells 14 pass along array rows as 
indicated by interceU arrows 22. Outputs from internal ceils 12 pass down array 

20 columns as indicated by vertical interceU arrows 24. Boundary cells 14 have 
diagonal inputs and outputs such as 26 and 28 interconnected along the array 
diagonal via latches 16. 

Referring now also to Figure 2, the processing functions of the internal and 
25 boundary cells 12 and 14 are shown in greater detail. On each clock cycle, 
each boundary ceil 14 receives an input value xjjj from above. It employs a 
stored coefficient r together with to compute cosine and sine rotation 

parameters c and s and an updated value of r in accordance with: 



30 



r' - 



(I) 



For = 0, c = 1 and s = 0; 
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otherwise:- 



c - r/r' , s - x.yr^ (2) 



05 



10 



15 



20 



30 



and r (updated) - r' (3) 

The parameters c and s are output horizontally to a neighbouring internal cell 12 

to the right. 

Each boundary ceil 14 also multiplies an upper left diagonal input by the 
parameter c to provide a lower right diagonal output SQ^t- 

ie 5 - CO. C"^) 

out m 

This provides for cumulative multiplication of c parameters along the array 
diagonal. 

On each clock cycle, each internal cell 12 receives input of c and s parameters 
from the left and from above. It computes Xgut and updates its stored 
coefficient r in accordance with:- 



X - - sr ex. (5) 

out m 

25 r (updated) - cr + sx.^ (6) 

Data input to the array 10 is illustrated schematically in Figure 3, in which the 
vertical dimension is shown foreshortened for iUusttational convenience. Figure 3 
shows a first vector Xj and a first element y^ in the process of input to the 
array 10. The vector X| has fifteen elements xn to xi i^, and is the leading 
row of a data matrix X. A column vector y. is input to the rightmost array 

column. The vector y has elements y^, Y2 and the nth element y^^ 

appears as an extension of the nth row x^j to x^^is of the data matrix X. As 
illustrated, y^ extends xi- 
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The first element x^i of the first input vector xi is input to the top row 
(leftmost) boundary cell 14. Successive elements etc of are input 

to successive top row internal ceils 12 with a temporal skew. Temporal skews 
are well known in the art of systolic arrays. In the present case the skew is a 
05 delay of one clock cycle between input to adjacent top row cells of elements of 
like vectors. The skew increases linearly to the right, so that input of the ith 
element x^^ of the nth vector x^ column of the array 10 lags input of 

x^l to the first column by (i-l) clock cycles. 

10 When xn is input to the uppermost boundary ceil 14. it is employed to compute 
roution parameters c, s for transforming the first vector xi into a rotated vector 
having a leading element of zero. On the clock cycle following input of xn to 
the uppermost boimdary cell 14, X|2 is input to its row neighbour internal ceil 
12 in synchronism with input of c, s computed from x^^. One clock cy^le later, 

15 the parameters c, s derived from xn reach the third cell from the left in the 
top row and are used to operate on x^j. In this manner, c,s computed from 
xii are employed to operate on elements X12 to x-^ jf and y^ on successive 
clock cycles. This produces a rotated version of xi from which, x^ is 
eliminated, the version passing to the second processor row, A similar procedure 

20 occurs in the second row, ie the rotated version of X|2 is used to compute c 
and s values for operation on the . routed versions of X|3 to X| ^5 and y|. This 
procedure continues down the processor rows until all x-vector elements have 
been eliminated . 



25 Subsequent data vectors X2» ^3 representing further rows of the data matrix 
X are processed in the same way as x^ by input to the uppermost array row. 
In general, the ith element x^i of the nth dau vector is input to the ith 
array column on the (n + i + l)th clock cycle. Similarly, the nth element y^ 
of the column vector ^ is rotated in each row as though it were an additional 

30 clement of the nth data vector x^. Each cumulatively rotated version of yn 
passes to the multiplier cell 18. Here it is multiplied by the cumulatively 
multiplied c rotation parameters derived from x^^ and computed along the array 
boundary cell diagonal. The output of the multiplier cell 18 is the least squares 
residual e^ given by:- 
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where: is the transpose of rjj, and 

Wn is a weight vector computed over all xi to x^^ to minimise the 
simi of the squares of e-[ to e^. 

In more general mathematical terms, the array 10 carries out a QR 
decomposition of the data matrix X as described in the prior art; ie the rotation 
algorithm operates on X to generate a matrix Q. such that:- 



2 X - 



R 

0 



(8) 



where R is an upper right triangular matrix. The matrix elements r of R are 
stored on individual internal and boundary cells 1 2 and 1 4 in all but the 
rightmost array column, and are recomputed every clock cycle. At the end of 
20 computation, the elements r may be extracted from their storage locations and 
used to compute the weight vector explicitly. 

Figures 1 to 3 exemplify a typical prior aa systolic array arranged inter alia to 
carry out QR decomposition. The array 10 exhibits the following characteristics 
25 which typify systolic arrays 

(a) nearest-neighbour cell interconnections form rows and columns; 

(b) many of cells (ie internal cells) have like signal processing functions; 



(c) each cell performs its processing function on each system clock cycle; 
and 

(d) signal flow is generally down columns and along rows of the array. 



35 
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Systolic arrays suffer from the major disadvantage of requiring large numbers of 
processing cells, such as internal ceils 12 in particular. To perform a QR 
decomposition on the data matrix X and associated residual extraction involving 
the vector v, the array 10 employs a linear chain of fifteen boundary ceils 14 

05 and a triangular sub-array of one hundred and twenty internal cells 12. The 
internal ceils 12 form a 15 x 15 sub-array, and the array 10 as a whole is a 16 
X 16 array. This arises from the fifteen-dimensional nature of the data matrix 
X and the one-dimensional nature of each element of the vector v. Generally, 
the number of ceils required in a systolic array grows as the square of the 

10 number of dimensions of the computation to be performed. In a version of the 
array 10 appropriate for an n-dimensional data matrix X, n(n + l)/2 internal 
cells 12 would be required. Each ceil is of the order of complexity of a 
microprocessor having floating point arithmetic capability, and requires the ability 
of a transputer to communicate with up to four neighbours. For computations 

15 where n is in the- order of IC or greater, the number of cells is of order 10"^ 
or more. The cost and bulk of such an array is therefore unacceptably large for 
many purposes. 

Referring now to Figure 4, there is shown a processor 40 of the invention. The 
20 processor 40 incorporates eight processing units to Pg with respective 

associated two-port memories M| to Mg. The unit P| is also associated with a 
two-port memory Mg. The units P| to Pg are connected to respective decoders 
D| to Dg and input/output ports l/0\ to I/Og. The input/output ports I/O^ to 
I/Og are shown in simplified form to reduce illustrational complexity, but will be 
25 described in more detail later. Each is arranged to accept up to four digital 
words simultaneously in parallel, and to transfer them serially to a corresponding 
processing unit P| to Pg. They also provide for serial word output. 

The ith processing unit Pj (i = 1 to 8) is associated with a respective data bus 
30 B-j and memory address bus Aj. The ith address bus Aj connects processing unit 
Pj to memories and Mj_|. Each of the input-output ports I/0| to I/Og 
has complex read/ write and data input/output connections (not shown) to external 
circuitry. These will bt illustrated in detail later. In Figure 4, they are 
indicated schematically by respective buses 41^ to 41 g. The ith data bus Bj 
35 connects processing unit Pj to memories and Mi-i» to port I/Oj and to a 
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block of word registers indicated generally by 42. The register block 42 
incorporates three sections 42| to 423 each of four registers R^^ to R34, the ith 
section 42| (i = 1 to 3) consisting of registers R{i to R14. The block 42 also 
includes a fourth section 424 consisting of one register R4^. Each register R^j is 

05 shown with a single or double arrow indicating its input side (left or right) and 
Che number of digital words stored; ie single and double arrows correspond to 
one and two stored words respectively. Each register is a first in, first out 
(FIFO) device. Registers R^^, R21, R31 and R41 are one word devices receiving 
input from the left and providing output to the right. Register inputs and 

10 outputs are unreferenced to reduce illustrational complexity. Registers R^-?, Rnn 
and R32 are also one word devices, but input from the right and output to the 
left. Registers R13, R|4, R23, R24. R33 and R34 are two word devices which 
input from the left and output to the right. 

15 The ith section of registers 42^ (i = 1 to 4) is connected to data bus Bg.j to its 
left, each register having a respective bus branch connection. The upper three 
registers (eg R32 to R34) of the ith section 42^ (i =1 to 3) are connected to 
data bus Bj+| (eg B4) to their right. However, the lowermost register Rjj in 
the ith section 42^ (i = 1 to 4) is connected to data bus B^, 

20 

The processing units to Pg have respective read-write output lines R/W| to 
R/Wg connected to ports I/0| to I/Og, associated memories Mq-M^ to My-Mg 
and registers R|2 ^21 R/W^ etc are each two bits wide 

as indicated by /2. The units P^ to Pg are also connected to their respective 
25 decoders D| to Dg by three-bit chip address lines C| to Cg marked /3. 

Each of the decoders Di to Dg has seven one-bit output lines such as Dn lines 
44 for example, and these lines are connected to respective memory, I/O port 
and register devices M^, I/Ox, ^11 ^^c. Some decoder lines such as those at 46 
30 of D5 are surplus to requirements. These are left unconnected as indicated by 
X symbols. X symbols also indicate unconnected buses below memories Mq and 
Mg. 

The mode of operation of the processor 40 as compared to that of the prior art 
35 device 10 is illustrated in Figure 5. In this drawing, conceptual locations of 
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internal ceils 12 in the device 10 are indicated as rectangles such as 50. The 
scale of the drawing is vertically foreshortened for iilustrationai convenience. 
Each of the processing units to Pg executes the computational tasks of a 
respective fifteen internal cells 12. In accordance with this, each rectangle 50 

05 incorporates within it a number indicating the associated processing unit; ie 
rectangles 50 (and the internal cells 12 they represent) referenced internally with 
the numeral i (i = 1, 2, ... 7 or 8) are associated with processing unit P;. 
Each rectangle also has external upper left and lower right indices VI and V2 
respectively, where VI is in the range 1 to 15 and V2 = VI + 15 in each case. 

10 VI and V2 respectively correspond to the first and second intervals in time at 
which the relevant processor in each case carries out the function of the internal 
ceil associated with the location. The drawing also includes diagonal lines 
representing memories Mq to Mg. Doited lines 52 link different regions 
associated with respective common memories. Locations representing^ register 

15 sections are indicated by multi-cornered lines with like references 42^ to 424. 

In operation, each processing unit Pj executes in sequence processing tasks which 
would be carried out by a respective fifteen internal cells 12 in the prior, art, A 
cycle of operation of the prior art systolic array 10 therefore requires fifteen 

20 cycles of the processor 40 of the invention. The latter will be referred to as 
subcycles. Subcycies 1 to 15 consequently correspond to cycle I. subcycles 16 to 
30 to cycle 2 and so on. Numerals VI and V2 in Figure 3 are subcycle 
numbers. On subcycles or VI values 1 to 15, processing unit P| executes the 
processing functions of internal cells located in the lower sections of the two 

25 lowest diagonals of the array 10, as indicated by numeral 1 within corresponding 
rectangles 50 in Figure 5. Unit P| begins on subcycle 1 with a computation 
corresponding to the function of that internal cell 12 in the centre of the lowest 
diagonal of the array of cells, as indicated by an upper left VI value of K On 
subcycle 2, as indicated by VI = 2, unit P| carries out a first cycle computation 

30 corresponding to the lowermost internal cell 12 in the final (rightmost) column. 
On subcycle 3, the computation is that of the internal cell 12 in the penultimate 
(second lowest) row and final column. This procedure is repeated on successive 
subcycles, the conceptual location of the processing operation reducing by one in 
row or column position alternately. After subcycle 15, ie after the end of cycle 

3 5 I . the computation executed is that of the lowest internal cell at the centre of 
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the lowest diagonal once more, as indicated by V2 = 16, and thereafter the 
sequence repeats to implement cycle 2. 

Similar processing sequences are executed by processing units P2 to Pg. Units 
P2. P3 and P4 carry out computations corresponding to respective pairs of pan 
minor diagonals. The equivalents for units P5, Pg and P7 are two respective 
complete minor diagonals together with two respective part minor diagonals. For 
unit Pg, there is a single upper right location and the upper parts of diagonals 
having lower parts associated with P|. 

Each of the units P| to Pg reads from and writes to respecJve memories among 
Mq to Mg and register sections 42^ to 424. Memories and registers are 
illustrated in Figure 5 adjacent the conceptual internal cell locations to which 
they are interfaced. For example, throughout each cycle processings unit P^ 
communicates with memories Mq and M|, but also communicates with register 
section 42^ on subcycle 1 (ie one subcycle per cycle). Unit Pj communicates 
with register section 42| on subcycle 1 and both register sections 42^ and 422 
subcycle 3. 

The mode of operation of the processor 40 of the invention will now be 
described in more detail with reference to Table 1 and Figure 6 to 8. Parts in 
Figures 6 to 8 and Table 1 which were described earlier are like referenced. 
Figure 2 illustrated each internal cell 12 receiving input of three quantities c, s 
and xjjj. performing a computation and generating outputs c, s and Xq^. This 
is re-expressed in Figure 6 as three read operations REl to RE3 and three write 
operations WRl to WR3. In Figure 7, the nth processing unit P^ (n = 1, 2, 
7 or 8) is shown connected between memories M^^i and M^. It incorporates 
processing logic responsive to a stored program in local (ie internal) memory 
which also contains a data address look-up table and a coefficient store. The 
look-up table is a list of fifteen address sets, ie. one set per subcycle. The 
coefficient store has space for fifteen updatable coefficients of the kind r, and for 
temporary storage of a value for which an output delay is required. In Figure 
8, the lower right hand region of Figure 5 is shown on an expanded scale. 
Memories Mq to M3 are shown subdivided into individual address locations 
labelled with integers. Not all address locations illustrated are employed. As in 
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Figure 5, in Figure 8 each processing unit to P3 (indicated by the relevant 
numerals within boxes) has an upper left numeral indicating subcycie number. In 
Table 1 , addresses in memories Mg to M3 arc given for read and write 
operations in processing units to P3 on subcycles 6 and 7. Addresses shown 
05 in Figure 8 and Table 1 are in the range 0 to 22 for illustrational convenience, 
although in practice a typical memory address space would be 256 (8 bits) or 
greater. 

As has been said, processing begins on subcycie 1, the first subcycie of the first 
10 cycle. However, it turns out that the first subcycie of each cycle is in fact a 
special case. In consequence, read/write operations on subcycles 2 onwards will 
first be described as being typical, and those of subcycie 1 will be discussed 
later. 

15 Processing unit P^ operates as follows. Referring to Figures 4, 7 and 8 once 
more, for each subcycie the stored programme in local memory has three 
successive read instructions: each of these requires data to be read from three 
data addresses of a respective address set stored in the local memory look-up 
table and corresponding to the current subcycie of operation. The look-up table 

20 also stores values for the chip address lines Cj. which are equivalent to a 
three-bit extension of the address bus A|. In Tabic 1, M^Z designates address 
Z in memory M^. On subcycie 2, the read operations REl . RE2 and RE3 for 
processing unit P| are from addresses M|0. MqS and Mo7 respectively. The unit 
P^ places an address on address bus A^ corresponding to Z = 0, and places a 

25 three-bit code on chip address lines providing for M| to be enabled by 
decoder D-^ and for Mq, Rn ^Ol ^ disabled. It also places a two-bit 
"read" code on read/write line pair R/W| to signify a read opera: oa. This 
causes memory M| to place the contents of its address 0 on the data bus , 
where it is read by processing unit P| as REl and temporarily stored. Unit P| 

30 then changes the code output at to that required for decoder D| to enable 
Mq. and changes the address on bus A^ to Z = 8 and subsequently =7. 
This provides for successive read operations RE2 and RE3 from addresses 8 and 
7 of Mq. 
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Having carried out three read operations in succession on subcycle 2, unit P^^ 
executes the internal cell computations shown in Figure 2 to generate Xq^^ and r 
(updated) for the second conceptual internal cell with which it is associated. It 
replaces its internally stored value of the second of fifteen coefficients r (initially 
zero) by r (updated), and is then ready to output the newly computed Xq^ 
together with two unchanged input values (c, s input as RE2. EIE3). On 
subcycle 2, unit performs the fxmction of the lowermost internal cell 12 of 
Figure 1, which provide c, s and Xq^^ signals to destinations outside the internal 
ceil sub-array. In the processor 40, this situation is implemented by write 
operations to an input/output port. The processing unit Pj consequently executes 
three successive write operations to port I/O^. It obtains from its look-up table 
the next three chip address codes. These are in fact the same code, that 
required to access port I/0| and for wiiich no address on bus A| is needed. 
They form the second half of the first address set. Unit P^ places *on chip 
^5 address lines C| the chip address code obtained from the look-up table. This 
activates decoder D| to enable port I/O^, and unit P^ subsequently places a 
two-bit "write" code on line pair R/W| and places values x^^^. c and s . in 
succession on data bus Bi as WRl, WR2 and WR3 respecdvely. This routes the 
values to subsequent signal processing circuitry (not shown) interfaced to port 
20 l/0|. 

Subcycle 2 ends when WR3 has been output, and proc€fSsing unit P| proceeds to 
implement the subcycle 3 functions. These require reading from M-^S, MqIO and 
Mo9, and writing to M^O (WRl) and I/O^ (WR2 and WR3), which form the 
third address set of unit P|. The WRl function overwrites the contents of M|0 
read on the preceding subcycle. Unit P^ also computes and internally stores an 
updated R-matrix element r appropriate to its third associated internal cell 
location {\ On later subcycles, as shown in Figure 8, the read and write 

operations are to and from memory addresses in Mq and M|. Table 1 gives the 
read and write memory addresses in Mq to M3 and port I/O3 for processing 
units P| to P3 on subcycles 6 and 7. 

Processing unit P^ reads from memories Mq/M^ and writes to those memories 
and/or poa l/Oi exclusively during subcycles 2 to 15. On subcycle 1 however, 
35 as indicated in Figure 5. it is interfaced with register section 42 1 immediately 
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above. As shown in Figure 6, REl Is read from above. It is therefore received 
from register of register section 42^ in response to an enable signal from 

decoder D| . Register receives input from the eighth processing unit P3 on 

later cycles. 

05 

Subcycie 1 is a special case in the operation of processing unit P^. So also are 
subcycies 16, 31 etc, ie the first subcycie of each cycle and numbered 
(15(n-l) + 1), n = 1, 2, 3 etc. These are also special cases for the other 
processing units P2 to Pg. The reason is as follows. In the simulated systolic 

10 array 10, dau and result flow is downwards and to the right. It progresses at 
the rate of one cell per clock cycle along rows and down columns. An internal 
cell having a neighbour to its left or above receives data from the neighbour 
which the neighbour used or computed one cycle earlier. In the processor 40 
however, as shown in Figure 5, a processing unit (Pj etc) proceeds conceptually 

15 upwards and to the left on successive subcycies in the reverse of the systolic 
array data flow direction. In consequence of this, inputs to a processing unit P| 
etc from a neighbouring location are not generated one cycle earlier, but instead 
one cycle minus one subcycie earlier. For most of each cycle this difference is 
immaterial. However, on subcycie 1 (and later equivalents) the right hand 

20 neighbouring location corresponds to subcycie 15; ie these two subcycies are the 
beginning and end of the same first cycle. The right hand location (VI = 15, 
V2 = 30) is fourteen subcycies behind the left hand location (VI = 1, V2 = 16) 
in this special case, instead of being one subcycie ahead as elsewhere in the 
cycle. In consequence, in the absence of arrangements to the contrary, right 

25 hand outputs (values c, s output as WR2, WR3) from processing unit ?i on 
subcycie I of the first cycle would be used as inputs on subcycie 15 of the first 
cycle. Similarly, the vertical output (x^^^ = WRl) from processing unit P^ on 
subcycie I to memory Mq would occur too early. This would conflict with the 
systolic array processing requirement that a result generated by a processing ceil 

30 on one cycle is to be employed by a neighbour to its right or below on the 
succeeding cycle. Similar remarks apply to all other processing units P2 to P3. 

To deal with this timing problem, on the first subcycie of each cycle, ie subcycie 
(I5(n-l) + 1), n = 1, 2, 3 etc, the processing units P| to Pg store internally 
35 their current values of Xq„*, c and s. They each output as WRl, WR2 and 
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WR3, the respective values of Xq^, c and s which they stored on the preceding 
cycle (if any). In consequence, on (and only on) the first subcycle of each 
cycle, outputs from the processing units to Pg are delayed by one cycle. 
This involves an additional three storage locations in each processing unit's 
05 internal coefficient store. 

At the end of subcycle 15, the first cycle of operations of unit P^ is complete 
and subcycle 16 begins the second cycle (V2 = 16 to 30). As shown in Figure 
5. processing unit P| reverts to execution of the computation of the internal cell 
10 12 at the centre of the lowermost diagonal. On cycle 2 (subcycles 16 to 30) the 
unit P^ reads in data (if any) stored in memories Mq and and register 
during cycle 1. It temporarily stores three data values to implement a one cycle 
delay. It also stores fifteen values of r in the process of updating, each r 
corresponding to a respective prior art internal cell, 

15 

Similar remarks apply to other processing units Pj to Pg and to later cycles. In 
general, a processing unit P^ reads from and writes to its associated memories 
M„_| and Mjj (n = 1 to 8) for most of a cycle. Exceptions to this are as 
follows. Units P5 to Pg execute REl (xjj^) from respective ports I/O5 to I/Og 
-0 when performing computations corresponding to internal cells 12 in the uppermost 
row of Figure 1 prior art array 10, Units P5, Pg and P7 are in this 

situati. on four subcycles of each cycle (eg umt P5 on subcycles 6 to 9 of 
cycle I), but unit Pg only three. This '•uppermost" REl operation is equivalent 
to input of an element of a data matrix X (see Figure 3) to an array 10. All 
eight processing units P^ to Pg execute processing functions corresponding to 
internal cells in extreme right hand column locations at respective points in each 
cycle. Units P^ to P7 are in this situation for two subcycles per cycle, whereas 
the equivalent for unit Pg is only one subcycle. When in this situation, the 
units ?i to Pg execute WR2 and WR3 to respective ports I/0| to I/Og; ?i 
executes WRl to I/0| also for one of its two subcycles in this situation as 
previously described. This extreme right hand output function corresponds to 
output from the prior art internal cell sub-array of Figure 1., Finally, a 
processing unit Pj (i = I to 8) reads from or writes to units Pio-i and/or Pg.; 
via the intervening register block 42. Each register such as or R|3 is a 

35 one or two word temporary storage device arranged on a first in. first out 
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(P^O) basis, as has been said- The registers R^^ to R^^ provide for 
communication between processing units which either do not share a common 
memory or require additional storage to avoid simultaneous memory addressing by 
two units. For exampie, as shown in Figure 3, on each of subcycies 3 and 5 
processing unit P3 performs two read operations RE2 and RE3 from unit P7 via 
registers R23 and R24 of register block 422- On subcycie 3, unit P3 reads the 
first stored words in registers R23 and R24. and on subcycie 5 it reads the 
succeeding words stored therein. It also reads the contents of R^^ as REl and 
writes to R22 as WRl on subcycie 5. Other read/ write operations are to and 
from memories M2 and M3. Similar remarks apply to other pairs of processing 
units interfaced together via the register block 42. 



The processing units to Pg operate in synchronism under the controi of an 
external clock (not shown). This is similar to prior art systolic arrays ^and wiil 
15 not be described. As illustrated and described with reference to Figures 5, 6 
and 8, the phasing of the read and write opcradons of the processing units P| 
to Pg is arranged to ensure that each of the memories Mq to Mg is rearired to 
respond only to a single address input at any time. For example, on subcycie 5 
in Figure 8, units P^ and P2 carry out read-write operations to memories Mq/M^ 
and M|/M2 respectively, which could cause a clash in access to M|. However, 
unit P| begins (REl) by reading from M| when unit P2 is reading from M->. 
Consequently, the P^ RE2 and RE3 operations are both from Mq, at which time 
?2 has switched to addressing Mj. This phasing of read operations avoids 
memory address conflict. Similar remarks app(y to write operadons and 
-5 processing units P3 to Pg and memories M3 and Mg. A read operation is at 
the beginning of any subcycie and a write operation is at the end. A memory 
(eg. M|) may consequently experience read and write operations on the same 
subcycie without conflict of addresses on an address bus (eg A-7); however, in 
general two simuluneous operations involving a single memory must be avoided. 
It is of course possible to accommodate such conflict at the expense of 
duplication of address buses and memories. 

Referring now also to Figure 9. in which parts previously described are 
like-referenced, the structure of each of the input/output ports I/0| co I/Og is 
shown in more detail. Subscript indices to references (eg 1 in I/0|) are omitted 
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to indicate ail pans of the relevant type are referred to. The port I/O 
incorporates a four-word parallel-in/serial-out input register 60. together with a 
one-word parallei-in/ parallel-out register 62. The input register 60 has four data 
input buses such as 64, and four write control inputs such as 66 connected to a 

05 common- write line 68. The output register 62 has an output bus 70 and an 
associated read output line 72, The read/write line pair R/W of Figure 4 
incorporates a read line 74 connected to the input register 60 and a write line 
76 connected to the output register 62, The two-way data bus B is coimected 
to both registers 60 and 62. The connections 64 to 72 inclusive were indicated 

10 collectively by bus 41 in Figure 4. 

The port I/O operates as follows. Immediately prior to the first subcycie of 
each cycle of operation of the processor 40, the write line 68 of the input 
register 60 is pulsed, and four digital data words are applied simultaneously to 

15 respective register inputs 64. This overwrites existing register contents and loads 
the four words in the register 60 in a successively disposed manner. Each time 
the read line 74 is pulsed, the word associated with the right hand input 64 is 
placed on the data bus B, and the remaining words are shifted to the right. 
This provides for the four loaded words to be output on the data bus B one 

20 after the other in response to four successive read line pulses. Referring now 
also to Figure 5 once more, it can be seen that processing unit Pg requires to 
read data from I/Og when it is performing top row computations on subcycles 
{V2 values), 19, 20, 25 and 26. On each of these subcycles, the unit Pg will 
send out a respective read pulse on line pair R/Wg, and requires a respective 

25 digital word to be placed on data bus Bg by its input register 60 consisting of 
the correct matrix element xjj of the data matrix X previously referred to. Unit 
Pg deals with the 5th, 6th, 11th and 12th top row cell locations. Matrix 
elements of the kind Xj^^g, ^n~lj^ ^n-6, 12 ^n-7,13 therefore 

sirr.uinr.eously inr ;t to the register 60 of unit Pg- Here n is a positive integer, 

30 and n-k less than or equal to zero is interpreted as Xj^-ljc q equal to zero for ail 
q. input is at the end of the last (fifteenth) subcycie of each cycle as has been 
indicated. This ensures that the data is present to be read in over the next 
cycle by different processing units executing top row computations at different 
times. The processing unit Pg reads in data words in reverse order (ie x^.y ^3 
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leading) on two successive subcycies followed by a four subcycle gap then two 
further subcycies. 

Similar remarlcs appiy to input to processing units P5, P7 and Unit P5 

05 requires input from port I/O5 on four successive subcycies » whereas three 
successive inputs from port I/Og suffice for unit Pg. Unit P7 requires input 
from pon I/O7 on four subcycies of which the first and last pairs are separated 
by eight subcycies, 

10 In practice, the input registers 60 of the processing units P5 10 Pg are arranged 
and loaded in parallel; they receive data simultaneously once per cycle. This 
occurs immediately prior to processing unit Pg computing the function of the 
uppermost and extreme right location (VI =1, V2 = 16) in Figure 5. It 
simulates the prior an systolic array 10, which receives top row inputs 

15 simultaneously. The contents of the registers 60 are overwritten - by each 
successive input- As will be described later in more detail, meaningful data (ie 
X12) is first processed by unit Pg on subcycle 30, the data having been input at 
60 prior to subcycle 16, Thereafter the data remains in the registers 60 until 
overwritten at the end of subcycle 30. 

20 

Output from the processor 40 via a port I/O of the kind shown in* Fijgure 9 is 
comparatively simple. A write pulse on the line 76 clocks the concents of data 
bus B into the output register 62, The read oorpot line 72 is pulsed by external 
circuitry (not shown) to read the register contents on to the output bus 70. A 

25 succeeding write pulse at 76 provides for the register contents to be overwritten. 
External circuitry (not shown) is arranged to read from the output rt -er 62 up 
to five times per cycle, this being the maximimi number of output values per 
cycle from a single port I/O. In the present example of the invention, 
processing units P^ to P4 only require output facilities such as output register 62. 

30 However, it is convenient to treat all units P^ to P3 as having like I/O ports. 

Referring to Figures 1 and 5 once more, it is useful to compare the operation of 
the prior art device 10 with that of the processor 40 of the invention. The 
device 10 employs signal flow progressing generally downwards and to the right. 
35 each of the cells 12 to 18 being clock activated and operating on every clock 
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cycle in phase with one another. In the processor 40 of the invention, this 
scheme is at least conceptually preserved from cycle to cycle. Each processing 
unit P| etc shown in Figure 5 receives data from the equivalent of above and to 
the left and outputs to the equivalent of below and to the right. In the case of 

05 processing unit P| on subcycle 1 (VI =: i), it receives from register section . 42^ 
-above" and memory Mq -to the left". It subsequently provides (internally 
delayed) outputs "below" and "to the right" via memory Mq, the outputs to the 
right being for use on the next cycle (subcycle 16 = V2). However, within a 
cycle, each of the processing units ?! to Pg deals with the conceptual internal 

10 cell locadons allocated to it in reverse order compared to prior art data flow. 
Thus the first locations to be processed are those lying on an upper right to 
lower left diagotiai (VI =1). Locations are processed in succession upwardly 
and to the left; eg processing unit P^ executes computations corresponding to 
internal cell locations in which the row and column numbers reduce , alternately 

15 by unity between successive subcycles. For units P5 to Pg, a discontinuous shift 
occurs after top row subcycles- On subcycle 1, the computations of units P3 to 
Pg correspond to internal cell locations on an array diagonal extending to the 
upper right hand comer. Unit Pj on subcycle 15 is processing the internal ceU 
location at row (9-i), column (8 + i) (i = 1 to 7) in Figure 5. (For 

20 comparison with Figure 1, the column number should be increased by 1 to allow 
for the extra column incorporating the uppermost boundary cell 14.) On 
subcycle 1, the equivalent for i = 1 to 8 is column (7 + i) with unchanged row 
number (9 — i), 

25 The reason for the conceptual reversing of the order of processing internal cell 
locations as indicated in Figure 5 is to ensure that intermediate computed values 
stored in memories or registers are not overwritten before they are needed. For 
example, referring to Figure 8 once more, on subcycle 3 processing unit P| 
overwrites the contents of address M^O which it read on the previous subcycle. 

30 The new value written to address M^O remains there to be read and then 
overwritten on the subsequent cycle fourteen subcycles later. If this procedure 
were to be reversed, the contents of address M|0 would be overwritten before 
being read during a cycle. In this connection it is emphasised that each of the 
processing units P| to Pg employs inputs generated on the preceding cycle and 

35 required to be unaffected by intervening computations. To avoid unwanted 
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overwriting of stored data without the aforesaid order reversal, it would be 
necessary to provide storage (double buffering)* and address and data buses 
additional to that shown in Figure 4. 

05 The conceptual reversal of the location processing order and the relative phasing 
of the operations of the processing units to Pg is implemented by the 
respective list of data addresses in each processing unit as illustrated in Figure 7. 
The addresses in any list are accessed in succession, and when a processing unit 
reaches the end of its list it begins again. The relative phasing illustrated in 

10 Fig^lre 5 is implemented by assigning the processing units to Pg appropriate 
start points for their respective address lists. 

The foregoing analysis relating to Figures 4 to 9 has not referred to the matter 
of processor start-up. It was assumed implicitly that, from V! value o_r subcycle 

15 I onwards, the processor 40 was processing data. In the prior art, as shown in 
Figures 1 to 3, it takes 15 cycles after input of xn to the topmost boundary 
cell 14 for y| to be input on cycle 16 to the internal cell 12 in the upper right 
comer. A further founeen cycles are required for a cumulatively processed result 
arising inter alia from to reach the lowermost internal cell 12. The start-up 

20 phase for a prior art systolic array 10 consequently passes as a wavefront from 
upper left down to lower right, the wavefront extending orthogonally to its 
propagation direction- An equivalent stan-up phase occurs in the processor 40 
of the invention. The first processing unit to operate on meaningful input data 
is Pg on subcycle 30 at the top left hand comer of Figure 5. Subcycle 30 is at 

25 the end of the second cycle during which X|2 to be input to processing unit 
Pg. On this subcycle, unit Pg is carrying out the processing task of the first 
(leftmost) top row internal ceil 12 shown in Figure 1, which receives successive 
matrix elements of the kind (n = 1, 2...), On subcycles 44 and 45, which 
are in the third cycle (not shown), unit Pg reads in X13 and X22 respectively to 

30 carry out the functions of the first and second top row internal cells 12 of 
Figure 1 . This start-up phase proceeds along rows and down columns in the 
Figure 5 representation. Eventually, on subcycle 437, the second subcycle of 
cycle 30, processing unit Pj receives inputs derived from X|| to x^ and y-^. 
It computes a result corresponding to the first meaningful output from the 

35 lowermost internal ceil 12 in the Figure 1 processor 10. The start-up phase is 
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then compiete. Start-up phases are well understood in the fields of systolic 
arrays and digital electronics and will not be * described further. It may be 
desirable to make provision for ignoring or inhibiting those outputs from the 
processor 40 wiiich do not correspond to real inputs. 

05 

The processor 40 requires an equivalent of the chain of boundary ceils 14 and 
delay latches 16 in order to operate on a data matrix X. It is necessary to 
compute parameters c and s from values output by processing units P^^ and P3, 
and temporarily stored in memories Mq and Mg for use one cycle later in each 

10 case. This is exemplified in Figure 8 for example on subcycle 6. On this 
subcycle, unit executes WRl to MgH; ie address 17 in memory Mq receives 
the equivalent of a vertical output of an internal ceil 12 destined for a boundary 
cell 14. The memory Mq is therefore required to be interfaced to a device 
which will access Mq17, compute c and s rotation parameters as shown^ in Figure 

15 2, and write c and s to Mq14 and Mq13 respectively for use on the next cycle. 
This is to be carried out on alternate subcycies, ie each occasion that unit ?i \s 
shown closely adjacent to memory Mq in Figure 5. Similarly, memory Mg is 
required to be interfaced to a second like device arranged to access it on 
alternate subcycies for computation and return of rotation parameters. This 

20 second like /ievice is required to receive matrix element xn on the first cycle of 
operation as indicated in Figures 1 and 3. It is also required to receive 
subsequent row leading matrix elements x^j (n = 2,3 ....). It will act as the 
uppermost boundary cell 14 in Figure 1 to generate c and s rotation parameters 
to be read as RE2 and RE3 by processing unit Pg at the end of the second 

25 cycle (VI = 30). These devices are straightforward to implement in practice. 
They will be processing devices similar to units P^ to Pg and interfaced to 
respective memories Mq and Mg via the data and address buses shown truncated 
in Figure 4. 

30 The processor 40 of the invention incorporates processing units P^ etc with 
internal memory containing an address look-up table and a store for three 
delayed values ind fifteen coefficients in addition to a programme. It is also 
possible to employ simpler processing devices with less internal memory capacity. 
In this case, the memories Mq etc might contain address lists and value and 

35 coefficient stores, and be associated with counters for counting through address 
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lists of respective processing devices. It may however be less convenient to 
implement and result in slower processing. This is because commercially 
available discrete processing devices such as transputers incorporate sufficient 
internal memory for the purposes of the processing units P| etc, and it would be 
05 inefficient not to use such facilities. However, the processor 40 might well be 
implemented as an integrated circtiit chip or wafer in which individual processing 
units, registers and memories become respective areas of silicon or gallium 
arsenide. In this case the most convenient balance between local and remote 
memory may be chosen. 

10 

The processor 40 is designed for the situation in which eight processing units P| 
to Pg, carry out the functions of one hundred and twenty internal ceils 12. In 
general, a triangular sub-array having n internal cells per (non-diagonal) outer 
edge has n(n+l)/2 cells. This number may be factorised either as n/2 and (n+1) 

15 or as n and (n+I)/2, Since n is a positive integer, one of n and (n+1) must be 
an even number. Consequently, n(n+l)/2 can always be factorised to two whole 
numbers, one of which may be treated as the number of processing units and 
the other the number of internal cells allocated to each processing unit. 
However, it may be necessary for there to be an odd number of processing 

20 units, as opposed to the even number (eight) employed in the processor 40. 

Referring to Figure 10, there is shown an alternative form of processor of the 
invention, this being indicated generally by 140 and incorporating an odd number 
(seven) of processing units. Parts in Figure 10 equivalent to those illustrated in 

25 Figure 4 have like reference characters P, M, D or R with asterisks. Subscript 
indices are changed to run from 1 to 7 instead of 1 to 8. The processor 140 is 
very similar to that described earlier, and will not be described in detail. 
Instead, it is observed that the only substantial difference between the processor 
140 and the earlier embodiment is that the former has no direct equivalent of 

30 processing unit P4. It has no direct equivalents of M4, D4 and R41 in 
consequence. Units P4 to P7 are in fact equivalent to units P5 to P3 
respectively. 

Figure 11 shows the relative phasing of operation of the processing units P| tn 
35 terms of VI and V2 values as before. It is referenced equivalentiy to Figure 5. 
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and shows that the processor 140 performs the function of a 13 x 13 triangnlar 
sub-array; ie a(n+l)/2 is 13 x 7 or 91. Each of the seven processing units P* 
to P7 corresponds to thirteen internal ceils as shown in the drawing. There are 
accordingly thirteen subcycles per cycle. In other respects, the processor 140 
operates equivalently to the earlier embodiment and will not be described in 
detail. 



10 



15 



20 



25 



30 



Comparison of the regular structures of Figures 4 and 11 demonstrates that the 
invention may be constructed in modular form by cascading integrated circuit 
chips. Each chip could contain two (or more) processing units such as P2 and 
Pg together with their associated registers to memories M2 and Mg etc. 

Processing units surplus to requirements on part-used chips would be bypassed. 
The processors 40 and 140 each employ one more memory Mq^Mq thi^n there 
are processing units P^ etc. This may be accommodated by the use of, an extra 
external memory rather than a largely bypassed integrated circuit. Alternatively, 
it is possible to omit Mq and connect buses A^/B| to Mg. This provides for 
units P^ and Pg together with rotation parameter computing means (previously 
mentioned) to address a common memory Mg. Similar remarks apply to MQ^Mg- 
It may constitute a cumbersome alternative, since it imposes substantial access 
requirements on memory Mg or Mg. 

The foregoing discussion was directed to the use of n/2 or (n+l)/2 processing 
units to carry out the function of an n x n triangular array of n(n-Kl)/2 
processing units. This may frequently be an optimum implementation, since it 
combines a substantial reduction in the number of processing units required with 
a comparatively high degree of parallelism. It should be at least n/2 times faster 
than a single computer carrying out the whole computation, while employing 
1.,^ 1; of the number of processing units required for a fully parallel array 
employing one unit per node as in Figure 1. However, the invention is not 
restricted to n/2 or (n+l)/2 processing units simulating an n x n triangular array. 
Figure 12 illustrates appropriate phasing of operation for four processing units 
simulating a 16 x 16 triangular array. VI and V2 values up to 68 are given. 



35 



A processor of the invention may be arranged to simulate both non-triangular 
systolic arrays and also arrays in which there are processing ceils with differing 
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computational functions. Individual cells may have more than one such function - 
eg a cell may s-witch between rwo computational functions on successive subcycles 
For most purposes, however, such an arrangement might be undesirably complex. 

05 
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CLAIMS 

1. A digital data processor for simulating operation of a 

parallel processing array, the processor (40) including an assembly of 
digital processing devices (P^ to P^) connected to data storing means 
(Mq to M3, R^^ etc), characterised in that:- 

(a) each processing device (P^ to P3) is programmed to implement a 
respective list (parts of three lists are shown in Table 1) of sets 
of storing means data addresses (eg M^12); 

(b) each address set (eg M^12, Mq20, Mq19. M^l?. MglB, M^IS) contains 
input: data addresses (eg M^12) and output data addresses (eg M^IJ) 
which differ, and each such set corresponds to data input/output 
functions of a respective simulated array cell (12); 

(c) each list of address sets corresponds to a respective sub-array of 
cells (12) of the simulated array, and each such list contains pairs 
of successive address sets (eg M^12, Mg20, M^ig, Mgl7. M^l6, M^iS) 

in which the leading address sets have input data addresses 
(eg Mj^l2) like to output data addresses of respective successive 
address sets, each list being arranged to provide for operations 
associated - ^ced cells (12) to be executed in reverse order 

to that cc ,-onding to data flow (22,24) through the simulated 
array ; and 

(d) each processing device (P^ to P3) is programmed to employ a 
respective first address set (eg M^12. Mq20, Mq19. MqIJ, Mgl6, Mgl5) 
to read input data from and write output data to the data storing 
means (Mg to Mg, R^^^ etc), the output data being generated in 
accordance with a computational function, to employ subsequent 
address sets (eg M^l?. Mq22. Mq21. M^12, M^14, M^13) in a like 
manner until the list is complete, and then to repeat this procedure 
cyclically. 
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2. A digital processor according to Claim 1 characterised in that 
each processing device (P^ to Pg) is arranged to communicate with not 
more than foiir other processing devices {?^ to Pg). and incorporating 
storing means including register devices (eg R^^) and memories (eg M^) 
connected between respective pairs of processing devices (P^/Pg. P^/Pj) . 
the address set lists being such that each register device (eg R^J and 
memory (eg M^) is addressed by not more than one processing device 

(eg P^) at a time. 

3. A digital processor according to Claim 2 characterised in that 
some of the processing devices (P^ to P^) are arranged to communicate 
with two of the other processing devices (P^ to Pg) via respective 
register devices (eg R^^ to R,*) and with a further two of the other 
processing devices (Pj to Pg) via respective aeaories (eg Mj). and 
wherein the address set lists are arranged such that the register- 
devices (eg Rj^ to R34) are addressed less frequently than the memories 
(egM,). 

14.. A digital processor according to Claim 1, 2 or 3 characterised 

in that some of the processing devices (P5 to Pg) include input means 
(I/O) arranged for parallel to serial conversion of input data elements. 

5. A digital processor according to any preceding claim 

characterised in that each processing device (P^ to P,) is arranged to 
store and update a respective coefficient in relation to each address 
set (eg M,12. Hg20. Hgl9. Mgl?. M„l6. M„15) in its list. 
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